library(Seurat)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#library(patchwork)
library(Matrix)

First we start by loading the data of the bone marrow (Homo sapiens) into our R environment. This data was obtain by using Illumina HiSeq 3000 and can be found here. The counts of the data are not normalised and the columns contain the cells while the rows have the genes in a sparse matrix (“sm”).

# we load our data in to the R environment 
bone.data <- load("SRA779509_SRS3805257.sparse.RData")
bone.data # check if the data is stored in sparsed matrix sm or sm2
## [1] "sm"
# we check the dimensions of the matrix.
# columns = cells
# rows = genes 
dim(sm)
## [1] 28556  8357
tail(rownames(sm))
## [1] "ZXDC_ENSG00000070476.14"   "ZYG11A_ENSG00000203995.9" 
## [3] "ZYG11B_ENSG00000162378.12" "ZYX_ENSG00000159840.15"   
## [5] "ZZEF1_ENSG00000074755.14"  "ZZZ3_ENSG00000036549.12"

Given that the names of the genes a long, what we can now do it to shorten the names of the genes so that keep only the first part of the name. This will help with easy identification and reading.

rownames(sm) <- sub(pattern="_E.*", replacement = "", x=rownames(sm), perl=TRUE)
tail(rownames(sm))
## [1] "ZXDC"   "ZYG11A" "ZYG11B" "ZYX"    "ZZEF1"  "ZZZ3"

Now we can now go on to create and have a look at the “Seurat object”. The Seurat object is a class which serves as a container for the storage and manipulation of single-cell data. It contains data (such as count matrix) and analysis (such as PCA).

# create the seurat object with some parameters
# min.cell =  minimum number of cells in which a gene can be detected
# min.features = minimum number of genes that have to be expressed in a cell
bone_mrw <- CreateSeuratObject(counts = sm, project = "Bone_marrow", min.cells = 3, min.features = 200)
## Warning: Non-unique features (rownames) present in the input matrix, making
## unique
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
bone_mrw
## An object of class Seurat 
## 22160 features across 5664 samples within 1 assay 
## Active assay: RNA (22160 features, 0 variable features)

By setting the “min.cells” and “min.features” parameter of the seurat object to 3 and 200 respectively, we retain 5664 cells (samples) and 22160 genes (features) from the initial 8357 cells and 28556 genes. The cells that did not meet our set parameters where discarded.

With the seurat object, we can easily perform quality control (QC) metrics and filter cells. In this case we perform or calculate mitochondrial QC metrics, which calculates the percentage of counts originating from a set of features. The number of unique genes and total molecules -in other words, the QC metrics- are automatically calculated when creating the seurat object by using “CreateSeuratObject” in the meta data (@meta.data). Therefore the mitochondrial QC metrics can be calculated and added as a column in the object metadata.

# we use the "PercentageFeatureSet" function to calculate the mitochondrial QC metrics
# We use the set of all genes starting with "MT-" as a set of mitochondrial genes.
# The [[ operator can add columns to object metadata. This is a great place to stash QC stats
bone_mrw[["percent.mt"]] <- PercentageFeatureSet(bone_mrw, pattern = "^MT-")
head(bone_mrw@meta.data)

We then visualise the QC metrics

VlnPlot(bone_mrw, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3, pt.size = 0)

# FeatureScatter is typically used to visualize feature-feature relationships, but can be used
# for anything calculated by the object, i.e. columns in object metadata, PC scores etc.

plot1 <- FeatureScatter(bone_mrw, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(bone_mrw, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")

plot1 + plot2

The plot shows us that the cells that have a low RNA count seem to have a high percentage of mitochrondrial RNA in 1st plot.

Next, we filter the data to retain the cells that have at least 200 expressed genes and not more than 3000 genes as well as <5% mitochrondrial counts.

bone_mrw <- subset(bone_mrw, subset = nFeature_RNA > 200 & nFeature_RNA < 3000 & percent.mt < 5)
bone_mrw
## An object of class Seurat 
## 22160 features across 4502 samples within 1 assay 
## Active assay: RNA (22160 features, 0 variable features)

Normalization of Data. After removing unwanted cells from the data set, the next step is to normalize the data. By default, Seurat normalizes the gene counts for each cell by the total counts for each cell, multiplies this by a scale factor (10,000 by default), and log-transforms the result. that is, it is log(read per ten thousands). Normalized values are stored in pbmc[[“RNA”]]@data.

bone_mrw <- NormalizeData(bone_mrw, normalization.method = "LogNormalize", scale.factor = 10000)

Since we want to compare the similarity between the cells at the transcriptional level we have to filter out the gene table and discard those that don’t give any useful information while keeping only the most variable genes. The choice is made on the relationship between SD, Var and mean of genes, keeping the top 2000 genes. This means that these 2000 genes are used for the rest of the analysis.

bone_mrw <- FindVariableFeatures(bone_mrw, selection.method = "vst", nfeatures = 4000)

# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(bone_mrw), 10)

top10
##  [1] "IGHA1"  "IGHA2"  "IGKC"   "IGHG1"  "IGHG3"  "IGHG4"  "IGHGP"  "JCHAIN"
##  [9] "IGLC2"  "MZB1"
# plot variable features with and without labels
plot1 <- VariableFeaturePlot(bone_mrw)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE, xnudge = 0, ynudge = 0)
plot1
## Warning: Transformation introduced infinite values in continuous x-axis

plot2
## Warning: Transformation introduced infinite values in continuous x-axis

The next step is to scale the counts that came out after normaliazation. This means that it will be a transformation so that for each gene, its count will have mean 0 and Var 1 across all cells. This is a step of “biniarization” because at the end all the values will be 1,0 or -1 according to the fact that the gene is highly, average or lowly expressed in a given cell. Here the original counts are transformed into data matrix that is very close to something that is close to a binary value.

all.genes <- rownames(bone_mrw)
bone_mrw <- ScaleData(bone_mrw, features = all.genes)
## Centering and scaling data matrix

Performing the PCA

# performing the PCA on the variable features that were defined 
bone_mrw <- RunPCA(bone_mrw, features = VariableFeatures(object = bone_mrw))
## PC_ 1 
## Positive:  ACTB, CST3, CSTA, LST1, CTSS, TYROBP, FCN1, PTMA, AIF1, FCER1G 
##     VIM, S100A11, LGALS1, ACTG1, CLEC7A, FTL, AP1S2, SAT1, MNDA, HLA-DRA 
##     S100A6, LYZ, SRGN, KLF4, GRN, CYBA, SERPINA1, PSAP, COTL1, NPC2 
## Negative:  ALAS2, SNCA, SLC25A37, HBD, AHSP, FECH, BPGM, SLC25A39, SELENBP1, HBM 
##     EPB42, CA1, GMPR, DCAF12, BNIP3L, SLC4A1, DMTN, HEMGN, GYPB, IFIT1B 
##     GYPC, BLVRB, BSG, EIF1AY, MKRN1, FAM210B, PDZK1IP1, HBQ1, GYPA, NCOA4 
## PC_ 2 
## Positive:  CENPF, NUSAP1, HMGB2, BIRC5, CA2, PCLAF, TUBB, CDK1, GYPA, TOP2A 
##     MKI67, HIST1H4C, PRC1, CDKN3, PTTG1, PRDX2, RRM2, RHCE, SMC4, TPX2 
##     DLGAP5, RHAG, ZWINT, CENPE, H1F0, TUBG1, HMBS, TYMS, CCNA2, CCNB2 
## Negative:  LTB, JUN, TRAC, ZFP36, IL32, TRBC2, ANXA1, CD7, SRGN, IFITM2 
##     DUSP2, AC245427.1, CD3G, IFITM3, IL7R, COTL1, AC058791.1, S100A10, KLF6, IER2 
##     ITGB2, CYBA, KLRB1, C1orf162, CD27, CCL5, C16orf54, AC016831.5, CTSW, ADGRE5 
## PC_ 3 
## Positive:  RPLP0, PTMA, HNRNPA1, NPM1, GAS5, JUN, LTB, HMGN1, CD79B, RPS2 
##     VPREB3, CD24, LRRC75A-AS1, IGLL1, PPIA, IGHM, CD79A, TRAC, VPREB1, HMGB1 
##     STMN1, MIR1244-2, EBF1, SOX4, TRBC2, TCL1A, IL32, CD7, ACTG1, ZFAS1 
## Negative:  S100A9, FCN1, S100A8, LGALS3, BLVRB, CSTA, CXCL8, S100A12, CLEC7A, SLC25A37 
##     SNCA, HBD, CST3, MNDA, ALAS2, GPX1, CD14, SERPINA1, BNIP3L, VCAN 
##     AHSP, LYZ, LST1, HBM, LGALS2, CFD, FECH, SLC4A1, FGL2, SELENBP1 
## PC_ 4 
## Positive:  IGHM, CD79B, VPREB3, HLA-DRA, CD79A, CD24, TCL1A, SOX4, FAM129C, CD74 
##     CD9, VPREB1, PCDH9, EBF1, TCF4, HLA-DMA, IGLL1, HLA-DRB1, RUBCNL, FCRLA 
##     CCDC191, GNG7, IGHD, HLA-DPB1, BCL7A, CD72, HLA-DPA1, SPIB, MZB1, HRK 
## Negative:  NKG7, CTSW, GZMA, CST7, CCL5, CD7, S100A4, KLRB1, KLRD1, GNLY 
##     PRF1, GZMH, FGFBP2, CD247, HOPX, GZMB, CCL4, SPON2, ITGB2, TRDC 
##     AC245427.1, KLRF1, S100A6, IL32, CLIC3, SRGN, ID2, IFITM3, IFITM2, ANXA1 
## PC_ 5 
## Positive:  LRRC75A-AS1, LTB, TRAC, GAS5, RPLP0, IL7R, RPS2, NPM1, CD3G, PRKCQ-AS1 
##     SERINC5, VIM, ZFAS1, HMGA1, IMPDH2, CD27, HSPD1, TMEM14C, DUT, MYC 
##     FAM178B, BEX3, COTL1, S100A6, HSPB1, PCLAF, AP3M2, AC058791.1, IL32, APOC1 
## Negative:  NKG7, GNLY, GZMB, FGFBP2, KLRD1, PRF1, GZMH, SPON2, CST7, KLRF1 
##     CLIC3, CCL4, GZMA, HOPX, TRDC, CCL5, CTSW, FCGR3A, PRSS23, XCL2 
##     TYROBP, FCER1G, IL2RB, ADGRG1, S1PR5, MATK, TTC38, RHOC, C1orf21, C12orf75
VizDimLoadings(bone_mrw, dims = 1:2, reduction = "pca")

# here we can visualize the result of the PCA
DimPlot(bone_mrw, reduction = "pca")

DimPlot(bone_mrw, reduction = "pca", dims = c(3,4))

We use the heatmap to visualize the expression of the most variable genes on the most variable cells in any PC selected.

DimHeatmap(bone_mrw, dims = 1, cells = 500, balanced = TRUE)

DimHeatmap(bone_mrw, dims = 1:9, cells = 500, balanced = TRUE)

We use the “ElbowPlot” function as a way to determine the number of PCs. This plot shows the standard deviation of each PC.

ElbowPlot(bone_mrw, ndims = 30)

The next step is the clustering of the cells.

bone_mrw<- FindNeighbors(bone_mrw, dims=1:15, reduction = "pca") # computes the nearest neighbor graph
## Computing nearest neighbor graph
## Computing SNN
bone_mrw<- FindClusters(bone_mrw, resolution = 0.5, algorithm = 1)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 4502
## Number of edges: 167051
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9185
## Number of communities: 15
## Elapsed time: 1 seconds
DimPlot(bone_mrw,reduction = "pca") # this labels the PCA plot with clusters. Each cluster gets a unique colour

bone_mrw <- RunUMAP(bone_mrw, dims = 1:15)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 11:22:51 UMAP embedding parameters a = 0.9922 b = 1.112
## 11:22:51 Read 4502 rows and found 15 numeric columns
## 11:22:51 Using Annoy for neighbor search, n_neighbors = 30
## 11:22:51 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 11:22:54 Writing NN index file to temp file /tmp/RtmpWDClLD/file4226164f036c
## 11:22:54 Searching Annoy index using 1 thread, search_k = 3000
## 11:22:59 Annoy recall = 100%
## 11:22:59 Commencing smooth kNN distance calibration using 1 thread
## 11:23:01 Initializing from normalized Laplacian + noise
## 11:23:03 Commencing optimization for 500 epochs, with 187986 positive edges
## 11:23:27 Optimization finished
DimPlot(bone_mrw,reduction = "umap") # visualise the clusters 

bone_mrw<- FindNeighbors(bone_mrw, dims=1:19, reduction = "pca")
## Computing nearest neighbor graph
## Computing SNN
bone_mrw<- FindClusters(bone_mrw, resolution = 0.5, algorithm = 1)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 4502
## Number of edges: 176229
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9251
## Number of communities: 16
## Elapsed time: 1 seconds
DimPlot(bone_mrw,reduction = "pca")

bone_mrw <- RunUMAP(bone_mrw, dims = 1:19)
## 11:23:41 UMAP embedding parameters a = 0.9922 b = 1.112
## 11:23:41 Read 4502 rows and found 19 numeric columns
## 11:23:41 Using Annoy for neighbor search, n_neighbors = 30
## 11:23:41 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 11:23:44 Writing NN index file to temp file /tmp/RtmpWDClLD/file4226e81e8a6
## 11:23:44 Searching Annoy index using 1 thread, search_k = 3000
## 11:23:49 Annoy recall = 100%
## 11:23:50 Commencing smooth kNN distance calibration using 1 thread
## 11:23:51 Initializing from normalized Laplacian + noise
## 11:23:53 Commencing optimization for 500 epochs, with 188300 positive edges
## 11:24:15 Optimization finished
DimPlot(bone_mrw,reduction = "umap")

bone_mrw<- FindNeighbors(bone_mrw, dims=1:20, reduction = "pca")
## Computing nearest neighbor graph
## Computing SNN
bone_mrw<- FindClusters(bone_mrw, resolution = 0.5, algorithm = 1)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 4502
## Number of edges: 178762
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9253
## Number of communities: 15
## Elapsed time: 2 seconds
DimPlot(bone_mrw,reduction = "pca")

bone_mrw <- RunUMAP(bone_mrw, dims = 1:20)
## 11:24:33 UMAP embedding parameters a = 0.9922 b = 1.112
## 11:24:33 Read 4502 rows and found 20 numeric columns
## 11:24:33 Using Annoy for neighbor search, n_neighbors = 30
## 11:24:33 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 11:24:36 Writing NN index file to temp file /tmp/RtmpWDClLD/file422610a8eb52
## 11:24:36 Searching Annoy index using 1 thread, search_k = 3000
## 11:24:43 Annoy recall = 100%
## 11:24:45 Commencing smooth kNN distance calibration using 1 thread
## 11:24:48 Initializing from normalized Laplacian + noise
## 11:24:50 Commencing optimization for 500 epochs, with 188798 positive edges
## 11:25:13 Optimization finished
DimPlot(bone_mrw,reduction = "umap")

#saveRDS(bone_mrw, file = "./bone_marrow.rds")

Find the marker genes of all the clusters

bone_mrw.markers <- FindAllMarkers(bone_mrw, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
## Calculating cluster 0
## For a more efficient implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the limma package
## --------------------------------------------
## install.packages('BiocManager')
## BiocManager::install('limma')
## --------------------------------------------
## After installation of limma, Seurat will automatically use the more 
## efficient implementation (no further action necessary).
## This message will be shown once per session
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
## Calculating cluster 9
## Calculating cluster 10
## Calculating cluster 11
## Calculating cluster 12
## Calculating cluster 13
## Calculating cluster 14
head(bone_mrw.markers)

The marker genes are then grouped by clusters and the top 2 genes of each cluster are used to identify the cells type of the cluster.

bone_mrw.markers %>% group_by(cluster) %>% top_n(2, avg_logFC)
# we get the number of cells in each cluster 
cell.num <- table(Idents(bone_mrw))
cell.num
## 
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14 
## 927 693 654 317 317 292 234 224 215 196 118 108  99  83  25

Showing the expression level of some of the marker genes

# showing the expression level of the genes  of the cluster attributed to Germ cells
VlnPlot(bone_mrw, features= c("IGLL1","DNTT", "STMN1"))

VlnPlot(bone_mrw, features = c("HBA1", "GAS5", "IL32"))

VlnPlot(bone_mrw, features = c("S100A8", "CCL5", "TNFRSF13C") )

VlnPlot(bone_mrw, features = c("PCLAF", "NKG7", "IGHA2"))

VlnPlot(bone_mrw, features = c("IRF8", "JCHAIN"))

FeaturePlot showing the expression of the selected marker genes across all clusters

FeaturePlot(bone_mrw, features = c("IGLL1","DNTT", "STMN1"))

FeaturePlot(bone_mrw, features = c("HBA1", "GAS5", "IL32"))

FeaturePlot(bone_mrw, features = c("S100A8", "CCL5", "TNFRSF13C", "IRF8", "JCHAIN", "IRF8", "JCHAIN"))

We use the “DoHeatmap” function to generate the expression heatmap of the top 3 marker of each cluster.

top4 <- bone_mrw.markers %>% group_by(cluster) %>% top_n(n = 3, wt = avg_logFC)
DoHeatmap(bone_mrw, features = top4$gene) + NoLegend()

Now we added the name of the cell types to each cluster

new.cluster.ids <- c("Erythroid-like", "Unknown", "T cells", "Dentrictic cells", "Erythroid-like", "Gamma delta T cells", "B cells", "Unknown", "Erythroid-like", "Gamma delta T cells", "B cells", "Germ cells", "Plasma cells", "Unknown", "Dentrictic cells")
names(new.cluster.ids) <- levels(bone_mrw)
bone_mrw <- RenameIdents(bone_mrw, new.cluster.ids)
DimPlot(bone_mrw, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()